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Abstract 

We investigate the asymptotic standard deviation of the Longest Common Sub- 
sequence (LCS) of two independent i.i.d. sequences of length n. The first sequence 
is drawn from a three letter alphabet {0, 1, a}, whilst the second sequence is binary. 
The main result of this article is that in this asymmetric case, the standard devia- 
tion of the length of the LCS is of order y/n. This confirms Waterman's conjecture 
|22j for this special case. Our result seems to indicate that in many other situations 
the order of the standard deviation is also y/n. 



1 Introduction 

In computational genetics and computational linguistics one of the basic problem 
is to find an optimal alignment between two given sequences X := X\ . . . X n and 
Y := Y\ . . .Y n . This requires a scoring system which can rank the alignments. 
Typically a substitution matrix gives the score for each possible pair of letters. The 
total score of an alignment is the sum of terms for each aligned pair of residues, 
plus a usually negative term for each gap (gap penalty). 

Let us look at an example. Take the sequences X and Y to be binary sequences. Let the 
substitution matrix be equal to: 








1 





2 


1 


1 


1 


3 



With the above matrix we get the following scores for pairs of letter: 
s(0, 0) = 2, s(0, 1) = s(l, 0) = 1, s(l, 1) = 3. 
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(Here, s(a,b) designates the score when we align letter a with letter 6.) Take X = 0101 
and Y — 1100 with the above substitution matrix and a zero gap penalty. The optimal 
alignment is: 

10 1.. 
.1.10 

The above alignment gives the score s(l, 1) + s(l, 1) = 3 + 3 = 6. This is the alignment with 
maximal score. We denote by L n the maximal alignment-score. In our example L n = 6. 

Let {Xj}j g pj and be two ergodic processes independent of each other. Let 

L n denote the optimal alignment score of the two finite sequences X := X\ . . . X n 
and Y := Y\ . . .Y n . Waterman [22] conjectured that in many cases VAR[L n ] is of 
order n. 

Throughout this paper the substitution matrix is equal to the identity and there 
is no gap penalty. In this case, the optimal score is equal to the length of the Longest 
Common Subsequence (LCS) of X and Y. (A common subsequence of X and Y is 
a sequence which is a subsequence of X as well as of Y.) 

We take the two sequences X and Y to be i.i.d. sequences. The letters of X are 
drawn from the three letter alphabet {0, 1, a} and Y is a binary sequence. The main 
result of this article is that y^4i?[L n ] is of order n. 

Let X = allalOOO and Y = 00110011. We remove the a's from X and obtain the binary 
sequence X 01 = 111000. The length of the LCS of X and Y is equal to the length of the 
LCS of X 01 and Y, since no a's appear in Y. In this example a LCS is 1100 and corresponds 
to the following alignment: 

1 _ _ 1 1 _ _ 
. 1 1 . 1 1 

(The letters which appear stacked one on top of the other correspond to the letters of the 
common subsequence 1100.) 

The reader might wonder why the case considered in the present article is rele- 
vant. Three letters in one sequence and two in the other might seem an unrealistic 
example. Our motivation is the following: in any i.i.d. sequence there are finite 
patterns (i.e. finite words) which tend to have below-average expected matching 
scores. The number of times any given finite pattern occurs in X = X\ . . . X n is 
roughly a binomial variable with variance proportional to n. Hence, the number of 
times we observe a given pattern in Y behaves roughly like the number of a's in 
Y. The number of a's in Y, decrease the optimal score linearly. For a given finite 
pattern with low average matching score the same should be true. 
Bonetto and Matzinger ^Uj simulated the situation where both sequences X and 
Y are binary i.i.d.. When the zeros and one's are equally likely (and for n around 
10000), surprisingly the standard deviation of L n is of order o(n 1//3 ). This is similar 
to the behavior of the Longest Increasing Sequence (LIS) as studied by Baik-Deift- 
Johansson [S] and Aldous-Diaconis [Q. On the other hand, Durringer, Lember and 
Matzinger JH] investigate the case where the sequence Y is periodic with a short 
period. They also find the standard deviation of L n to be of order o(^/n). Hence a 
small change in the parameters of the model can change the asymptotic behaviour 
of the random variable L n completely. 
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Let us mention a little bit of the history of these problems: 

Using a sub additivity argument, Chvatal-Sankoff prove that the limit 

7 := lim E[L n ]/n 

n— >oo 

exists. The exact value of 7 remains however unknown. Chvatal-Sankoff derive 
upper and lower bounds for 7, and similar upper bounds were found by Baeza- Yates, 
Gavalda, Navarro and Scheihing [S] using an entropy argument. These bounds have 
been improved by Deken ^3], an d subsequently by Dancik-Paterson |13U2Uj . In |16j . 
Hauser, Martinez and Matzinger developed a Monte Carlo and large deviation-based 
method which allows to further improve the upper bounds on 7. Their approach 
can be seen as a generalization of the method of Dancik-Paterson. 
For sequence with many letters, Kiwi, Loebl and Matousek, |17j have the following 
interesting result: 

when both sequences X and Y are drawn from the alphabet {1,2,... , k} and the 
letters are equiprobable, then 7 — > 2j\fk as k — > 00. 

Waterman- Arratia jjj derive a law of large deviation for L n for fluctuations on scales 
larger than ^/n. The order of magnitude of the deviation from the mean of L n is 
unknown, and in fact it is not even known if these deviations are larger than a power 
of n. However, using first passage percolation methods, Alexander [2] proves that 
E[L n ]/n converges at a rate of order at least yTog n/n. 

Waterman [22] studies the statistical significance of the results produced by 
sequence alignment methods. An important problem that was open for decades 
concerns the longest increasing subsequence (LIS) of random permutations and ap- 
pears to be related to the LCS-problem. However, it is an open question to know 
if solutions of the LIS-problem can be used to study the LCS problem, see Baik- 
Deift- Johansson 9 and Aldous-Diaconis pQ. 

Another problem related to the LCS-problem is that of comparing sequences X 
and Y by looking for longest common words that appear both in X and Y, and 
generalizations of this problem where the word does not need to appear in exactly 
the same form in the two sequences. The distributions that appear in this context 
have been studied by Arratia-Gordon-Goldstein- Waterman [3] and Neuhauser [15] . 
A crucial role is played by the Chen-Stein Method for the Poisson-Approximation. 
Arratia-Gordon- Waterman 0J|S] shed some light on the relation between the Erdos- 
Renyi law for random coin tossing and the above mentioned problem. In |Hj the 
same authors also developed an extreme value theory for this problem. 

For a general discussion of the relevance of string comparison for biology and of 
other similar problem in computational bilogy the reader can refer to the standard 
texts [E][2T][l2l- 

2 Main result 

Throughout this paper {AjjjgN and {Yi}ieN are two i.i.d. sequences which are 
independent of each other and which satisfy all of the following three conditions: 
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1. The variables X^i e N, have state space {0, 1, a}. 

2. There exists p, < p < 1 such that 

P(X l =a)=p, P{X X = 0) = P(X X = 1) = i^E. (2.1) 

3. The variables € N, are Bernoulli variables with parameter 1/2. 

When all the three conditions above are satisfied, we say we are in case I . The main 
result of this paper is: 

Theorem 2.1 When we are in case I, there exists k > not depending on n, such 
that for all n£N, we have 

VAR[L n ] >k-n. (2.2) 
There is also an upper bound for the variance 

VAR[L n ] < K-n 

where K > is a constant not depending on n. This upper bound follows directly 
from the large deviation result for LCS of Waterman- Arratia [7j. Let us give this 
result: 

Lemma 2.1 Assume that we are in case I, then: 

there exists a constant c > (not depending on n and A) such that for all n large 
enough and all A > 0, we have that: 

P(\L n -E[L n ]\>nA)<e- cnA2 (2.3) 

Theorem 12.11 and lemma |2~T1 together imply that the typical size of L n — E[L n ] is 
o{^Jn). More precisely, let D n := [L n — E[L n ])/y/n denote the rescaled fluctuation 
of L n . Then: 

Theorem 2.2 The sequence {D n } is tight. Moreover, the limit of any weakly con- 
vergent subsequence of {D n } is not a Dirac measure. 

Theorem 12. 21 is a rather direct consequence of theorem 12.11 and lemma l2~Tl We refer 
the reader to ^Hl for the proof . 



3 Proof of main theorem 

Let A^ a designate the numbers of a's in the sequence X = X1X2 ■ ■ ■ X n . Let X 01 
designate the subsequence of X consisting of all the 0's and l's contained in X. 
In other words, X 01 is obtained by removing the a's from the finite sequence X. 
Thus, X 01 is a finite sequence of i.i.d. Bernoulli variables with parameter 1/2 with 
random length. The length of the random binary string X 01 is equal to (n — N a ). 

Let us illustrate this with a practical example. For n = 6, assume that X = OllaOa and 
Y = 101011. In this case N a = 2 and X 01 = 0110. Obviously the a's from sequence X 
can not be matched since Y does not contain any a's. Hence, The length Lq of the LCS of 



4 



X and Y is equal to the length of the LCS between X 01 and Y. The length of the LCS is 
Lq = 3. There are actual three longest common subsequences: Oil, 010 and 110. 

The main idea why L n fluctuates on the scale y/n is the following: The binomial 
variable N a has variance of order o(n). The variable L n tends to decrease linearly 
with an increase of N a (since the a's are not matched and thus constitute losses). 
Hence L n should also fluctuate on the scale \fn. 

To prove this rigorously, we simulate the variable L n in a special way. We first 
simulate a variable with same distribution as N a . (We can call it N a .) Then we 
generate X 01 by using a drop-scheme of random bits. Instead of flipping a coin 
independently n — N a times in a row we generate a sequence Z 1 , Z 2 , . . . of binary 
strings where Z k has length k. Z k+1 is obtained by adding to Z k a random bit at 
a random location. 

For example, assume that we have the binary string Z 6 = 00010. There are four possible 
positions where the next bit could come: 



position 1 


position 2 


position 3 


position 4 


0x0010 


00x010 


000x10 


0001x0 



where x designates the possible position of the next bit. We assign the same probability 
to each of the four above possibilities and draw one of them at random. Wc flip a fair 
coin, and fill the previously chosen position with the number obtained from the fair coin. 
If the position chosen is the second one and the fair coin gives us a 1, then we obtain 
Z 7 = 001010. 

We apply this scheme recursively on k and obtain a sequence of random binary 
strings Z 1 , Z 2 , . . . , Z n . Let Zf designate that i-th bit of the fc-th string. With that 
notation: 

Z k Z k Z k 

Hence, {Z k }i<k<n is a triangular array of Bernoulli variables. Let us next define the 
Z h, s in a formal way: let k G N be a sequence of i.i.d. Bernoulli variables with 
parameter 1/2. Let k G N be a sequence of independent integer variables, so that 
{^4}fceN is independent of {Tfc}fc e jj. Furthermore, for fc£N, let the distribution of 
Tfc + i be the uniform distribution on the set {2, . . . , k}, (i.e. for all s € {1, ... , k}, 
we have that P(T^ = s) = l/(k — 1).) We define Z k recursively in k: 

• Let Z 2 := ViV 2 . 

• Given the binary string Z = Z£ Z!£ . . . ZE, we define Z k+1 : 



- For all j < Tfc+i, let 

- For j = T k+1 , let 



z k+1 = v k+1 . 
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- For j, such that T^ +1 < j < k + 1, let 

(Thus Vk designates the fc-th bit added and designates the position where it gets 
added.) 
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To prove the main result of this paper, we generate a variable having same 
distribution as L n using the bit-drop-scheme. Instead of generating the sequence X, 
we generate the triangular array {Z^}i<k< n and, independently, a random number 
N a with binomial distribution with parameters p and n. Then, we look for the 
longest common subsequence of Y and Z k with k = n — N a . 

More precisely, let L^{k) designate the length of the Longest Common Subsequence 
of Z k and Y = Y X Y 2 . . '.. Y n . Then: 

Lemma 3.1 Assume that case I holds and Z k is generated independently ofY and 
N a , according to the mechanism described above. Then, L n has same distribution 
as L a n {n-N a ). 

Proof. For every l,k > we have that P(L n = l\N a = k) = P(L^(n - k) = I). 
This gives the thesis. ■ 

We can now explain the main idea behind the proof of Theorem 12.11 assume 
/ is a map with bounded slope so that f'(x) > c > for all x G R. Let B be 
any random variable. Lemma 13.21 tells us, that in this case, the variance of f(B) is 
bounded below by c 2 • VAR[B]. On the other hand, the map k i— ► is very likely 
to increase above a linear rate larger than a constant k\ ^> 0. Hence V-AJ^-Z^] — 
VAR[L a n (n = N a )] should be larger then k\ VAR[N a \. The most difficult part in the 
proof is showing that with high probability the slope of k >— * L®(k) is "everywhere" 
bounded below by a positive constant. This problem is solved in the next section. 
Let us look at the details of the proof of Theorem 12.11 

Lemma 3.2 Let c > 0. Assume that f : R — » R is a map which is everywhere 
differentiate and such that for all x S R: 

Let B be a random variable such that E[\f(B)\] < +oo Then: 

VAR[f(B)] > c 2 • VAR[B]. (3.2) 

Proof. We have that E[B] and E[f{B)\ are finite. Observe that \mx x -^± 00 f {x) = 
±oo and f(x) is strictly increasing so that there exists xq G R such that 

f(x ) = E[f(B)]. (3.3) 

By the mean value theorem, we know that there exists a map 5 : R — > R such 
that for all x G R we have 

f(x) = f(x ) + f'(8(x))(x-x ). (3.4) 
By definition of variance and eas. ()3.3|)(|3.4j) we have: 

VAR[f(B)] = E[(f(B) - f(x )) 2 ] = E[f'(5(B)) 2 (B - x ) 2 ] (3.5) 
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Using ea. p.lf) we get: 

VAR[f(B)\ >c 2 E[{B-xof]. (3.6) 

Observe that 

E[(B - x ) 2 } > minEUB - yf] = VAR[B] (3.7) 

y 

where we used a well known minimizing property of the variance. This immediately 
gives 

VAR[f{B)] > c 2 VAR[B] (3.8) 

which finishes this proof. ■ 

Typically, the (random) map k ^ L a (k) does not strictly increase for every k £ 
[0, n}. But it is likely that every order o(lnn) points, it increases by a linear quantity. 
Next we define an event which guarantees that the map k \— > L a (k) increases linearly 
on the scale o(lnn): 

Definition 3.1 Let designate the event that Vi, j, such that < i < j < n 

and i + ki Inn < j, we have: 

L a (j)-L a (i) >h\i-j\. (3.9) 

Here k\,k<i > designate constants which do not depend on n and which will be 
fixed in the proofs in sects. 4, 5. 

The above definition gives the discrete equivalent of condition l|3.1j) in the case of a 
discrete function. Before proceeding we need a discrete version of Lemma 13.21 

Lemma 3.3 Let c,m > be two constants. Let f : 7L — > 7L be a non decreasing 
map such that: 

• for all i < j: 

f(j)-f(i)<(j-i) (3.10) 

• for all i,j such that i + m < j: 

f(j)-f(i)>c-(j-i). (3.11) 
Let B be an integer random variable such that E[\f(B)\] < +oo. Then: 

VAmB) ^'\ l -^wmm) VAm - (312) 

Proof. Because of conditions (|3.10|) and (|3.11|) . we can find a continuously differ- 
entiable map g : M — > R satisfying the following conditions: 

• g agrees with / on every integer which is a multiple of m. 

• Vx G R, we have that 

c<g'(x)<l. (3.13) 
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Thus, we can apply lemma l3~2l to g(B) and find: 

VAR[g(B)] > c 2 ■ VAR[B]. (3.14) 
The random variable g{B) approximates f(B): 

\f(B)-g(B)\<(l-c)-m (3.15) 

Hence, 

VAR[f(B) - g{B)) < m 2 (3.16) 

Since, f{B) = g{B) + (f(B) — g{B)), we can apply the triangular inequality and 
find: 

y/VAR[f(B)] > y / VAR[g(B)] - y/VAR[f(B) - g(B)} (3.17) 

Hence: 



VAR[f(B)\ > VAR[g(B)] - 2^VAR[g(B)] ■ \fVAR[f(B) - g{B)} = 

=var [9{ b)] Lv^siy 

n \ y/VAR\g(B)] J 

Applying the inequalities (|3. 14f) and (|3. 16|) to the last inequality above, yields 

vm!m> - r2vAm { l -7wm) (318) 

which finishes this proof. ■ 

Let az designate the <r-algebra of the triangular array Z\ and ayz the cr-algebra 
of the triangular array Z\ and of the Y{. Thus: 

a z := cr(Zi\i < k < n) a Y z '■= a(Z^,Yj\i <k< n,j < n). 

We are now ready for the proof of the main theorem 12. II of this article. 

Proof of theorem 12.11 By Lemma 13.11 it is enough to prove that there exits 
k > not depending on n, such that: 

VAR[L a (n - N a )\ > kn. (3.19) 

Note that for any random variable D and any cr-field a, we have 

VAR[D] = VAR[ E[D\a] ] + E[ VAR[D\a] ]. (3.20) 

Thus, since the variance is never negative, we find that 

VAR[D] > E[ VAR[D\a] }. (3.21) 

Taking L a (n — N a ) for D and ayz for a, we find: 

VAR[L a (n - N a )} > E[ VAR[L a (n - N a )\a YZ ] } (3.22) 
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Note that the map L a (-) is cry^-measurable. Thus, conditional on ayz, L a (-) be- 
comes a non-random increasing map. The event £y is uyz-measurable. When 
E^ holds, then the hypotheses of Lemma 13.31 holds for / = L a (-) with c = fa and 
m = &2 Inn. This implies that 

VAR[L a (n - N a )\a YZ ] > (fa) 2 ( 1 - - j=^L =) VAR[N a \a YZ ] (3.23) 

V fa V V AR [N a \cyz\J 

Since N a is a binomial variable with parameter p and n and is independent from 
0Yz, we have that 

VAR[N a ] = VAR[N a \a YZ ] = np(l - p). (3.24) 
Using the last equality with inequality ()3.23jl . we obtain: 

VAR[L a {n - N a )\a YZ ] > np(l - p) (fa) 2 1 1 2 A?2 lnn j (3.25) 



fay/Pi 1 ~P) 



n 



> n-P(El ope ). 



p(l-p)(fa) 2 1 



(3.26) 



Since, VAR[L a (n — N a )\ayz] is never negative and since inequality 13.251 holds, 
whenever E£ holds, we find 

VAR[L n ] > E[VAR[L a (n- N a )\<j YZ ] } > 

2/c2 In n 

faVp( 1 -p) n / 

The expression on the right side of inequality 1)3.26(1 divided by n converges to 

^Wi-p) (fa) 2 - 

We will show in Lemma 14 . 1 1 b elow that P(E^ ) — > 1 as n — ► oo. Hence, for all n 
big enough, T/j4i?[L n ] is larger than np(l — p) (fa) 2 /2 > 0. This finishes the proof 
of theorem 12.11 



4 Slope of L a ( ) 

This section is dedicated to the proof of the following lemma: 
Lemma 4.1 We have that: 

P(El opc ) - 1 (4.1) 

as n — > oo. 

We first need a few definitions. A common subsequence of length m of the two 
sequences Z k and Y, can be viewed as a pair of strictly increasing functions: 

such that 7T : [1, m] — ► [1, k], n : [1, m] — ► [1, n] and 

Vt€[l,m], ^ W =^(0- (4-2) 
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Definitions: 

1. Let tt : [l,m] — > [l,k] and 77 : [l,m] — * [l,n] be two increasing functions. 
The pair of (tt, 77) is called a pair of matching subsequences of Z k and Y iff it 
satisfies condition (|4.2|) . 

2. Let designate the set of all pairs of matching subsequences of Z k and Y . 

3. Let designate the set of all pairs of matching subsequences of Z k and Y 
of maximal length, (i.e. of maximal length in the set M k .) 

4. Let < indicate the natural partial order relation between increasing functions 
tt : [l,m] — > N, i.e. tt\ < TT2 iff, for every i G 7Ti(i) < 7Ta (i) . With a 
slight abuse of notation we will indicate with < also the partial order induced 
on the pairs of increasing function (71", rj), i.e. (7ri,?yi) < (7^, 772) iff 7Ti < 7T2 
and r/i < 772- 

5. Let M k C designate the set of all (tt, 77) G M k which are minimal according 
to the relation <, (i.e. minimal in the set M k ). 

6. Let (tt, 77) be a pair of matching subsequences of length m and let i G [0, m — 1]. 
We call the quadruple 

+ 1), 77(i),r/(i + l)), (4.3) 

a match of (tt, 77). If 77(7]) + 2 < 77(7 + 1), we call the match a non-empty match. 
If there exists j, such that 77(7) < j < rj(i + 1) and Yj = 1, resp. Yj = 0, we say 
that the match contains a 1, resp. a 0. We also say that the match contains the 
point j and call the bit Yj a free bit of the match (tt(i), tt(i + 1), r)(i), r](i + 1)). 
Sometimes we identify the match (ir(i),TT(i + 1), 77(7), 77(7 + 1)) with the couple 
of binary words: 

( z t(i) z t(i)+i ■ ■ ■ z t(i+i) > • • • Y n(i+i) ) • 

7. Let < s < t < n. We call the integer interval [s, t] = {s, s + 1, . . . , t} a block 
of Y, if for all r G [s,t] we have F r = Y s but y s _i / Y s and Y t ^ Y t+1 . The 
cardinality | [s, t] \ = s — t + 1 is called length of the block [s,t\. 

Let us give an illustrative example. Take Z 6 = 101011, n = 9 and Y = 111000111. Let 
(71", 77) be denned as follows: 

7r(l) = 1,tt(2) = 3,tt(3) = 4,tt(4) = 5,tt(5) = 6 

and 

77(1) = 1, 77(2) = 2, 77(3) = 4, 77(4) = 7, 77(5) = 8. 

Then, (71", 77) is a pair of matching subsequences of Z G and Y. The common subsequence 
associated with it is: 

Z\Z%Z\zlZ% - YxY 2 Y 4 Y 7 Y s = 11011. 
We represent the pair of matching subsequences (tt, 77) using an alignment of Z 6 and Y: 

101_0__11_ 
1 _ 1 1 1 1 1 

In this example (tt, n) contains the four following matches: 
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1. 

1 1 

1 . 1 

2. 

1 . 
1 1 

3. 

. . 1 

1 

4. 

1 1 
1 1 

The first match above is empty. The second match contains a one. Here, Y~ 3 is a free bit of 
the second match. The third match contains two zero's: I5 and Yg are free bits of the third 
match. The forth match is empty. The common subsequence 11011 is of maximal length 
(among all the common subsequences of Z 6 and Y). So, we have that L a (6) = 5. Hence, 
L a (7) can only be equal to 5 or 6. 

What is the probability that L a (7) is larger by one than L a (6)? When we generate Z 7 by 
dropping the bit Vj on Z 6 , then there are five positions where it can fall: 



position 1 


position 2 


position 3 


position 4 


position 5 


1x01011 


10x1011 


101x011 


1010x11 


10101x1 



where x designates the possible positions of the bit V 7 . Each of these positions has same 
probability. Positions 1 and 2 correspond to the first match. Position 3 corresponds to 
the second match. Position 4 correspond to the third match and position 5 corresponds to 
match number four. 

If V7 = 1 and the bit drops on the match which contains a one (that is match number 
two corresponding to position three, i.e. T7 = 3), then L a (7) = L a (6) + 1. The reason is 
that the bit V 7 can then get matched with the free 1-bit in match two and increase the 
score L a (Q) by one. Similarly, if Vj = and the bit V 7 drops on match number three, 
the score gets increased by one, since then V 7 gets matched with the "free" zero contained 
in match number three. Hence, when V 7 drops on match number three, the result is: 
L a {7) = L a (6) + 1. In general L a (k + 1) = L a {k) + 1, if the bit V k+1 drops on a match 
which contains a bit of the same color as to Vk+i- (By color, we mean or 1.) 

From the idea of the previous example, we can get a lower bound for the probability 
that the score L a (k) increases by one. The bit Vk+i is equally likely to be equal 
to one or equal to zero. So, when it drops on a nonempty match, the score has at 
least 50% probability to increase. Each nonempty match corresponds to at least 
one position. The bit V k+1 has k — 1 equally likely positions. It follows: for any 
pair (tt, rj) of matching subsequences of Z k and Y: 

p(L«(k + l) = L«(k) + l I Z\ Y ) > l - • * ° f n ° nempty " iatch6S ° f ^ 3> (4.4) 
if (tt, rj) is of maximal length. 
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Let us explain at this stage the main ideas for the proof of lemma 14.11 We 
distinguish two cases depending on the value of k. 

We first deal with the case k < 0.45n. In this case it easy to show that with large 
probability all the bits in Z k are matched. Let E™ k be the event: 

E? k := {L%(k) = k} (4.5) 

and 

0.45n 

Ei := fl E h- ( 4 -6) 

k=l 

Observe that we have 

El = {L a n (k + 1) - L a n {k) = 1, Vfc < 0.45n} (4.7) 

i.e. the slope of L^(k) is equal to 1 for all k < 0.45n if E™ holds. In the next section 
we prove the following lemma: 

Lemma 4.2 We have 

lim P(E%) = 1. (4.8) 

n— >oo 



Assume that instead of looking for a LCS, we want to know if one sequence is contained in 
another. For example for given I € N, we may be interested in finding out if the sequence 
Z k is a subsequence of Y\fi . . .Y\. For this let v(i) be the smallest I such that Z\ is 
a subsequence of Y1Y2 . . .Yi. Then, f(2), i/(3), . . . defines a renewal process. The 

interarrival times Zj = z/(i + 1) — v(i) have geometric distribution and expectation E[Ii] = 2. 
Thus, E[v(i)\ = 2i and VAR[n] — o(n). From this it follows that if we want Z k to be with 
high probability a subsequence of YxY^ . . .Yu we need to take I somewhat above 2k. Let us 
give a numerical example. Take Z 3 = 001 and Y = 10101000111. Then, v(l) denotes the 
indices of the first Yi equal to zero. In this case, v (1) = 2. Similarly, v(2) is the smallest 
i > K-0 such that Yi = Z\ — 0. Here: v{2) = 4. Finally, v(i) is the smallest i > v(2), such 
that Y$ = \, hence z/(3) = 5. 



Let us next give the main ideas, why with high probability, the slope of k 1— > L a (k) 
is increasing linearly on the domain [0.45n, n]. We use the bit-drop scheme to prove 
this: we show that typically the random map k 1— ► L a (k) has a positive drift 7 > 0. 
We define: 

^2k := |^( 7r ' v) G # OI " nonempty matches of (ir, rf) is larger than jn j . (4.9) 

When i?2fc holds, every pair (-7T, ry) G M k has at least 771 non-empty matches. The 
proportion of non-empty matches to k hence is larger or equal to 7. Using inequality 
14.41 it follows that 

P ( L a (k + 1) = L a (k) + 1 I Z k , Y ) > 0.5 • 7 (4.10) 
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when E^ k holds. Let E^ be the event: 

n 

E?:= f| E n 2k . (4.11) 

fc=0.45n 

Inequality 14. 1UI implies, that when E% holds, the map k t— > L a (k) has positive drift 
O.57 > for £ [0.45n, n]. By large deviation it follows, that with high probability 

1— > L a (k) has positive slope on [0.45n, ra] as soon as E% holds . (See lemma li~9l ) 
It remains to explain why EV[ holds with high probability. 
Let us first summarize the general idea: 

We proceed by contradiction. Assume all the matches of (it, 77) £ M k were empty. 
Then all of the following would hold: 

(t/(1), 7/(2), 7/(3), . . . , 7/(m)) = (r/(l), 7/(1) + 1, 7/(1) + 2, . . . , 7/(1) + m) 
where m is the length of the LCS of Z k and Y: m = L a (k). 
• The sequence 



^7(1)^7(2) • • • Yn(m) — ^r)(l)^T)(l)+l • • • *rj(l)+m 



is a subsequence of 



^(1)^(1)+! • • • Zj -K(m)- 



Hence we would have two independent i.i.d. sequences of Bernoulli variables with 
parameter 1/2, where one is contained in the other as subsequence. This implies 
that the sequence containing the other must be approximately twice as long. Hence 
k is approximately at least twice as large as m = L a (k). Thus, the ratio L a {k)/k is 
close to 50% or below. This is very unlikely, since it is known that the L a {k)/k is 
typically above 80%. This is our contradiction. 

From the previous argument it follows that with high probability any (71", 77) £ M k 
contains a non-vanishing proportion e > of free bits. (Hence, L^ L (k) / r](L^(k)) > e.) 
We need to show that this proportion e of free bits generates sufficiently many non- 
empty matches: the free bits should not be concentrated in a too small number of 
matches. 

Let us go back to the numerical example on page 9 to illustrate how we count the proportion 
of bits that are free. In that example, the first match of (^,77) contains no free bit. The 
second match contains one free bit which is a one. The third match contains two free bits 
which are zero's. The forth match contains no free bit. The sequence Y contains a total 
of 8 bits which are involved in a match of (tt, 77). (Note that the last bit Yg of Y is not 
counted since it is not involved in a match of (tt, 77) .) We have a proportion of free bits to 
bits involved in matches equal to: 

3/8 - (8 - 5)/5 - rfjtfk)) - ~rtET' 
The 3 free bits generate two non-empty matches. 

To prove that there are more than 771 nonempty matches two arguments are used: 
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• Any pair of matching subsequence (ir, rj) which is minimal according to our 
partial order for pairs of matches satisfies: 

every match of (it, rj) can contain zero's or one's but not both at the same 
time. Hence, each match of (71", 77) G M k contains free bits from at most one 
block of Y. 

• With high probability, the total number of integer points in [0, n] contained in 
blocks of Y of length > D is very small. (By choosing D large, we make the 
total number of points contained in blocks longer than D, much smaller than 
the number of free bits.) 

From the two points above, it follows that for (tt, rj) £ M k , the majority of free bits 
are at most D per match. This ensures that the proportion e of free bits, generates 
a proportion of at least order e/D non-empty matches. 

Let us look at an example of a pair (tt, rj) which is of maximal length but not minimal 
according to our order relation on M 2 fe . Take Z 7 = 0101101 and Y = 00110010111. Define 
the pair of matching subsequences (tt, rj) as follows: 

tt(1) = 1, tt(2) = 2, tt(3) = 3, tt(4) = 4, tt(5) = 5, tt(6) = 7 

and 

77(1) = 1,77(2) = 7,77(3) = 8,77(4) = 9,7/(5) = 10,77(6) = 11. 
Let us represent this pair of matching subsequences by an alignment: 

0_____101101 
0011001011_1 

This gives the common subsequence 010111. The pair (71,77) is °f maximal length, but it 
is not minimal for our order relation on M^; instead of 77(2) = 7, take 77* (2) = 3. Let 
otherwise 77* be equal to 77. Then (tt, 77*) is strictly below (77,77). To construct 77* we used 
the fact that a match of (ir, rj) contained both zero's and one's. It is always possible to find 
a strictly smaller pair (ir, 77*) € m^f when a match of (ir, rj) contains hero's and one's at the 
same time. 

Note that (tt, rj) contains 5 free bits, but only one non-empty match. All the free bits of 
(7r, 77) are concentrated in one match. The match containing all the free bits contains several 
blocks. By taking a minimal pair of matching subsequences, this kind of situation is avoided. 

Let us look at the details of the proof of lemma l4~T1 Let Lf(k) denote the length 
of the LCS of Z k and the sequence Y l := Y\Y^ . ■ ■ Y\. For Y l to be entirely contained 
as a subsequence in Z k , one needs A: to be approximately twice as long as I. (We 
have that Y is a subsequence of Z k iff Lf(k) = I.) Hence, it is unlikely that that Y l 
is a subsequence of Z k , when k = 21(1 — 5). (Here 5 > is a constant not depending 
on I.) In other words, it is unlikely that: 

Lf(2l(l-5))>1. 

Similarly, it is unlikely, that Y l is "close to being a subsequence of Z kv , when 
k = 21(1 -8): 
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Lemma 4.3 There exists a function 5 : R — > R suc/j i/iaf lim e _^o <^( e ) = and 

P (if (21(1 - 5{e))) > 1(1 - e)) < Ce" d (4.12) 

/or all I > and suitable constants c > and C > no£ depending on I. (Note that 
the constants c > and C > may depend on e.) 

We can now define: 

El = {Lf (21(1 - 5(e))) < (1-6)/} (4.13) 

and 

n 

E^-= E Bk (4-14) 

fc=0.2n 

where e is a suitable number, to be fixed in the following, and 5(e) is given by 
Lemma 14.31 It follows that: 

Corollary 4.1 If 5(e) in the definition of E^ is given by lemma \J73[ we have 

lim P(E?) = 1. (4.15) 

n^oo 

Typically, L^(k) is above 80% • k. However, to make things easier, we prove only 
that it is above 65% • k. We define: 

E n , k := {L a n (k) > 0.65k} (4.16) 

and 

n 

El:= fl El k . (4.17) 

fc=0.45n 

The next lemma is proven in the next section: 
Lemma 4.4 We have 

lim P(E2) = 1. (4.18) 

n^oo 

Let us define the event E^ k : 



E, 



a- 



:= {K(k) < (l-e) V (L a n (k)), V(vr,n) G M k ) (4.19) 



and 



E%:= f| E% k . (4.20) 

fc=0.45n 

The event E^ k says that any pair of matching subsequences (vr, n) £ M k has a 
proportion of at least e free bits. (Note that rj(L^(k)) is the number of the last bit 
of Y involved in a match of (vr,n). Furthermore, L^(k) represents the number of 
bits that are "matched" by (jr, rj). Hence, n(L"(/c)) — L^(k) is the number of "free" 
bits.) 
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Lemma 4.5 Take e > small enough, so that 

t^|t < 65%. (4.21) 
1 - d(e) 

Then, we have that, for all k > 0.45n ; 

E%nE2 k CE£ k . (4.22) 

Thus 

E% n El C E%. (4.23) 

Proof. Let k G [0.45n, n]. We show that if .Eg fc does not hold and E% holds, 
then E™u can not hold. This in terms implies f4. 221 

Let (tt, if) G M fc . If E^ k does not hold, than the proportion of "free" bits of (7r, 77) 
is below e. In other words: 

Lf(k) 

~l - 1-6 

where I := r/(L^(/c)). (Note that Lf(k) = L^(k), since (tt,i]) is of maximal length.) 
It follows that 

L?(k)>l(l-e). (4.24) 

Now, when E£ k holds, then 

Lf (2Z(1 - 5(e))) < 1(1 - e). (4.25) 

Comparing inequality 14.241 with 14. 2 51 and noting that the (random) map x 1— ► Lf(x) 
is increasing, yields: 

fc > 2/(1 - (5(e)) 

and hence 

> 277(L«(fc))(l - 6(e)) > 2L«(fc)(l - -5(e)). 
From this it follows, that: 

SG> < «» < 65 % (4.26) 
k 1 — d(e) 

where the 65%-bound is obtained from inequality 14.211 Inequality 14.261 contradicts 

To obtain E% we must be sure that the free bits of Y do not concentrate in a 
small amount of of matches of (tt, rj) G M k . As explained in the example on page 
12, any match of (tt, n) G M k can contain 0's or l's, (or nothing) but not 0's and l's 
at the same time. This is due to the minimality respect to the ordering <. In fact 
if (ir(i), ir(i + l),i](i),r](i + 1)) is a non empty match we must have that Y\ ^ Y^u+t) 
for all rj(i) < I < rj(i+l). Otherwise, we could match the bit Z^u+x) with Y\ instead 
of YyU+iy This modification would yield a pair of matching subsequences of same 
length but strictly smaller according to our order relation on M%. Thus, all the free 
bits of a match of (tt, if) G M k are contained in only one block of Y. 
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It is useful to see how many bits are contained in long blocks. Let BLOCK 
designate the set of all blocks [i,j] C [0,n] of Y of length at least D. (For the 
definition of blocks see the definitions at the beginning of this section.) Let N D 
denote the total number of points in the sequence Y which are contained in a block 
of length at least D: 

N D := | {s G [l,n] | G BLOCK D , s G \ . (4.27) 

Let Eg designate the event: 

E% := {N D < en/4} (4.28) 

We will show in sec. 6 that: 

Lemma 4.6 For every e there exists D such that 

lim pre?) = 1. (4.29) 

n— >oo 



We then have the following combinatorial fact: 
Lemma 4.7 We have that, for all k > 0.45n: 

E2nE2nE% k c-E2 k (4.30) 

with 7 = 17ms a/so: 

E2nE£nE% C E%. (4.31) 



Proof. We prove 14.301 The event -Eg fc implies that for each (tt, rf) G M k there are 
at least e r/(L^(/e)) free bits. We have: 

r,(L a n (k))>L a n (k). (4.32) 

When PJ holds, we have that: 

K(k) > 0.65k. (4.33) 

Since we take k > 0.45n, inequalities 14.321 and 14.331 together imply that the number 
of free bits of (tt, rf) G M k is at least 

e 0.65 • 0.45n = e 0.2925n. 



By Eg , there are at most 0.25en bits contained in blocks of length > D. Thus, there 
are at least 0.0425e • n free bits contained in blocks of length < D. Recall that every 
match of (tt, 77) G M k contains free bits from only one block. Hence, every match of 
(71", 77) G M k can contain at most D — 1 free bits from blocks of length < D. Hence, 
these e 0.0425n free bits which are not in N D , must fill at least e 0.0425n/(.D — 1) 
matches of (vr, 77) G M k . It follows that (vr, 77) G M k has at least 0.0425e -n/(D-l) 
non-empty matches. ■ 
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Lemmas S3] and IO jointly imply that that E% n E% n Eg C E% ■ Hence: 

P{ET) < P{ET) + P{ET) + P{ET) ( 4 -34) 

where E™ c denotes the complement of E%. We have that P(E% C ), P(E% C ) and 
P{Eg c ) all converge to zero when n — > oo. (This follows from Lemmas 14. 1114.41 and 
14.61 ) Hence, we have that: 

lim P(E%) = 1. (4.35) 

n— »oo 

Let <7fc denote the cr-algebra: 

a k := a{Zf,Y j \i <k,j< n). 

It is easy to check that E^u is cr^-measurable. Note that L a {k + 1) — L a (k) is always 
equal to one or zero. 

Lemma 4.8 When E^ k holds, then 

P (L a (k + 1) - L a (k) = 1| a k ) > 0.5 7 . (4.36) 

Proof. This has already been explained. (See inequality 14. 10j) . ■ 
We finally observe that 

^opc) < Wiopc n (E% n Ef)) + P{ET) + P{ET)- (4.37) 

Since P(Ef c ) and P(E2 C ) both go to zero as n goes to infinity, we only need to 
prove that 

P(EZ pc n (E% n ED) ^ for n ^ oo, (4.38) 
to establish lemma l4~Tl 

Lemma 4.9 We have that 

p(EZ pc n(E%nE?))^o 

as n — > oo. 

Proof. We can assume that 7 < 1. Define k\ := 0.4ry, so that k\ < 0.4. Let 

A(k) :=L a n {k + l)-L a n {k) 

when E^, holds, and A(k) := 1 otherwise. From ea. (|4.36|) . it follows that: 

P (A(fe) = 1| (T fc ) > O.57. (4.39) 

Furthermore, A(k) is equal to zero or one and cr^-measurable. For k e]0.45n,n], let 

k-i 

L a n (k) = L a n (0A5n) + £ A(i). 

i=0.45n 
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For k G [0,0.45n], let L°(k) := L a n {k). Note that when E% holds, then 

L a {k) = L a {k) (4.40) 

for all k £ [0, n — 1]. Introduce the event E™ lope to be the event such that Vi, j, with 
0.45n < i < j < n and i + &2 Inn < j, we have: 

L*(j)-L»(i)>fei|i-j|. (4.41) 

When holds, then L^k) has a slope of one on the domain [0,0.45]. Hence, 
the slope condition of -E^ope holds on the domain [0, 0.45n], since we have k\ < 0.4. 
When E~2 holds, then L^(k) and L®(k) are equal. It follows that when E% and E™ lope 
both hold, then the slope condition of E™ lope is verified on the domain [0.45n, n]. 
Hence 

El n El n E n slove = El n El n E^. (4.42) 

Thus 

P{E n sl c ope n n El) = P(EZ pe n E? n ££) < P^e)- 

It only remains to prove that PiE^f^) goes to zero as n —* oo. For this we can use 
large deviation. Let .E™- be the event that 

L a n (j) - L a n (i) > h\i - j\ 

Then 

where the intersection in the last equation above is taken over all i,j £ [0.45n, n] 
such that i + k 2 Inn < j. It follows that 

i,j 

where the last sum is taken over all [0.45n, n] such that i + &2 In n < j. Since 

we took k% = O.47 and because of I4.39( large deviation tells us that there exists 
constants c, C > such that 

P(E^) < Ce-^-ti (4.44) 

for all i,j e N. (The constants C, c do not depend on i,j.) Take £;2 := 3/c. With 
this choice. 14.441 becomes: 

P{E%) < Cn~ 3 (4.45) 

when /c2 Inn < |i— j|. Note that there are less than n 2 terms in the sum in inequality 
14.431 By 14.451 each term in the sum in inequality 14.431 is less or equal to Cn -3 . 
Thus inequality 14.431 and 14.451 together imply that 

F n 
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5 Bounds for the probabilities. 



We report in this section several proofs of the lemmas used in sec. 4. 

Lemma 5.1 for every n and v < 0.5 we have 

P(K(vn) = vn) > 1 - e <^- u)2n (5.1) 

Proof. We can build a pair of matching subsequences has follows: start from Z\ 
and match it with the first = Z k , then match Z\ with the first Y{ 2 = Z\ such 
that 12 > %!■ We can proceed as before until we reach the end of the Z k or of the 
Y . More precisely we can define a matching (ir,rj) such that ir(i) = i and v(i) = 
in-U>u(i~i){Yi = %i} ( see remark after Lemma 14.21 for an explicit example). Given 
Z k and Y we call Tj the sequence of random variables defined by Tj = v(j) — v(j — l). 
Observe that the Tj is a sequence of independent random variable all with geometric 
distribution of parameter ^. It follows that 



(vn \ / vn j \ 



(5.2) 



but 



vn 

_— ns 



(5.3) 



(5.4) 



P {^Z T i ~ ~ > °j -™l E (e^-o^W)) 
Due to the independence of the T, we have 

E = E [e< T ^)y n = (-£ 

It is easy to check that 

inf f^^Ve- s <e c (°- 5 -^ (5.5) 

for a suitable constant c, so that we get 

P{L a n {vn) =vn)>l- e c ^- - 5 ) 2 ™. (5.6) 



Proof of lemma 14. 2L It follows immediately from the above lemma. ■ 
In a very similar way we can prove that 

Lemma 5.2 For every k 

P{L a k {2{\ - 8)k) =k) <Ce c52k (5.7) 
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Proof. Observe that the only possibility for L^(k) = k is that the pair of matching 
subsequences constructed at the beginning of the proof of lemma has length k. 
Using the notation of that proof we have that 

P (L a (2-S)k(k) = k )= P {ll T ^( 2 - S ) k ) ( 5 - 8 ) 

This quantity can be evaluated as in the previous proof to obtain the lemma. ■ 
We can now estimate the probability of E^ k . Proof of Lemma 14.31 Consider 

a subset of S C [0,/] containing (1 — e)l points. There are (jn- e )) suc ^ subset. We 
can fix the sequence Y on the subset S. We have 2 d Y's that agree on S. Calling 
5(e) = e + 5'(e) we have, due to Lemma 15.21 that the probability of matching all Y 
in S is bounded by e~ s ( e > 1 . Collecting the above estimates we get that 



P(Lf (21 (1 - 5(e))) > 1(1 - e)) < 2 d ( l _ ) e~*<# 1 < 

< c , e [ £ (ln2+ln e )+(l- e )ln(l- e )- C 5'(e) 2 ]« / g _ g j 

where we have used Stirling's formula. Thus it is enough to chose 




(5.10) 



to obtain the lemma. ■ 

Proof of lemma 14.41 We can divide the sequences Z k and Y is subsequences 
of length 10 and write Li(k) < Y^l=\ Li where Lj is the longest common subse- 
quence between Y"io(t-i)+i • • • Yioi and ^io(j_i)_)_i • • • ^ioi- F rom Chvatal we know 
that E(Li) = 6.97844. From a standard large deviation argument we get 

P (e Ih < * - >)) < (tat E (e-^-<— >») ) * (5.11) 

Calling p(s, 5) = E ( e s ( L o-( - 69 ~ 5 ))) it easy to see that p(s, 5) is smooth in s, p(0, 5) = 
1 and d s p(0, 5) < for every 5 > 0. This implies that 

mfp(s,5) < e- c ^ (5.12) 

s<0 

for suitable c(5) > 0. This immediately give the thesis of the Lemma. 
■ 

Finally we prove the lemma li"o1 Proof of lemma l4.6l Let N D be the number 

of integer points in [0, n — D] which are followed by at least D times the same color 
in the sequence Y . Thus, N D is the number of integer points s 6 [0,n-D] so that 

Y s = Y S+1 = ... =Y S+D . (5.13) 
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It is easy to check that 

N D < DN D . (5.14) 
Let now Y s , s £ [0, n — D], be equal to 1 iff 15.131 holds, and otherwise. We find: 

n 

^2? S = N D . (5.15) 

8=1 

To estimate the sum 15.151 we can decompose it into D sub sums Si, S2, . . . , S73 
where 

s — 1 , . . . ,n 
s mod D — i 

so that 

D 

iV D = ^S, (5.17) 

i=l 

It is easy to see that 

P (N D >\n)<P (N D > ^n) < D • P (s > ^n) (5.18) 

where the last inequality follows from the fact that at least one of the addends in 
15. 171 has to be larger than jf^n. Now, the Y s appearing in the sub sum So are i.i.d. 
Bernoulli random variable with P(Y S = 1) = 2~ D . We can apply a large deviation 
argument analogous to the one used in the previous proof and obtain 

p(s >(2- d + ( 5)^) <e-< s ^. (5.19) 
with c{8) > for 5 > 0. Thus it is enough to choose D such that D2~ D < 4 ■ 
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